Mechanistic Interpretability on prediction of repeated tokens
The emergence of large-scale language models, particularly ChatGPT, has astounded many, including myself, with its remarkable linguistic abilities and versatility in completing various tasks. Despite our admiration for its capabilities, researchers, myself included, often find ourselves puzzled. Even with knowledge of the model’s structure and weight values, understanding why a specific input sequence leads to a particular output sequence remains a challenge.
In this blog post, I aim to clarify GPT2-small using mechanistic interpretability in a simple context: predicting repeated tokens.
Traditional mathematical tools for explaining machine learning models may not be entirely appropriate for language models. For instance, while SHAP is effective at identifying significant features that influence predictions, it primarily operates at the feature level, which may not align well with token-level predictions of language models.
Language Models (LLMs) present a complex high-dimensional space with numerous parameters and inputs. Computing SHAP values in such a space can be computationally expensive and may not provide deep insights into the model’s decision-making process.
Mechanistic Interpretability offers a different perspective by shedding light on the underlying mechanisms or reasoning processes of a model’s predictions, rather than just pinpointing important features or inputs.
For our analysis, we will use GPT2-small and the TransformerLens library, designed specifically for interpreting GPT-2 style language models.
gpt2_small: HookedTransformer = HookedTransformer.from_pretrained("gpt2-small")
The above code loads the GPT2-Small model to predict tokens on a sequence generated by a specific function that includes repeated token sequences.
Finding Induction Head
An interesting observation when running the model on the generated token sequence is that it performs significantly better on the second half of the sequence compared to the first half. This performance difference is measured by log probabilities on correct tokens.
We identify an induction circuit that scans the sequence for prior instances of the current token, predicts the next token based on the previous token, and repeats the sequence. This circuit consists of an induction head and a previous token head.
By analyzing attention patterns and scores, we can pinpoint the specific heads in GPT2-small responsible for induction and previous token identification.
MLP Layer Attribution
We explore the contribution of MLP layers in GPT2-Small by using an ablation technique to observe changes in the model’s performance when removing MLP layer outputs from the second half of the sequence.
Surprisingly, the ablation does not significantly impact the model’s predictions, indicating that MLP layers may not play a crucial role in the context of repeated tokens.
Constructing Induction Circuit
We manually construct an induction circuit using specific heads in GPT2-Small that were identified as the induction head and previous token head. This circuit demonstrates a reasonable accuracy of 0.2283, despite being constructed with only two heads.
For a detailed implementation, please refer to my notebook.